﻿WEBVTT

00:00:08.691 --> 00:00:15.429
- Hello, hi. So I want to get started.
Welcome to CS 231N Lecture 11.

00:00:15.430 --> 00:00:23.258
We're going to talk about today detection segmentation and a whole bunch
of other really exciting topics around core computer vision tasks.

00:00:23.259 --> 00:00:25.590
But as usual, a couple
administrative notes.

00:00:25.590 --> 00:00:31.358
So last time you obviously took the midterm, we
didn't have lecture, hopefully that went okay

00:00:31.358 --> 00:00:42.269
for all of you but so we're going to work on grading the midterm this week, but as a reminder
please don't make any public discussions about the midterm questions or answers or whatever

00:00:42.270 --> 00:00:48.517
until at least tomorrow because there are still some people
taking makeup midterms today and throughout the rest of the week

00:00:48.518 --> 00:00:53.668
so we just ask you that you refrain from
talking publicly about midterm questions.

00:00:56.329 --> 00:01:02.920
Why don't you wait until Monday?
[laughing] Okay, great.

00:01:02.921 --> 00:01:07.760
So we're also starting to work on midterm grading. We'll get
those back to you as soon as you can, as soon as we can.

00:01:07.761 --> 00:01:14.078
We're also starting to work on grading assignment two so there's
a lot of grading being done this week. The TA's are pretty busy.

00:01:14.079 --> 00:01:18.479
Also a reminder for you guys, hopefully you've been
working hard on your projects now that most of you

00:01:18.479 --> 00:01:26.969
are done with the midterm so your project milestones will be due on
Tuesday so any sort of last minute changes that you had in your projects,

00:01:26.970 --> 00:01:31.650
I know some people decided to switch projects after
the proposal, some teams reshuffled a little bit,

00:01:31.650 --> 00:01:39.676
that's fine but your milestone should reflect the project that you're actually
doing for the rest of the quarter. So hopefully that's going out well.

00:01:39.677 --> 00:01:43.900
I know there's been a lot of worry and stress
on Piazza, wondering about assignment three.

00:01:43.900 --> 00:01:50.188
So we're working on that as hard as we can but that's actually
a bit of a new assignment, it's changing a bit from last year

00:01:50.189 --> 00:01:53.951
so it will be out as soon as possible,
hopefully today or tomorrow.

00:01:53.951 --> 00:02:01.550
Although we promise that whenever it comes out you'll have two
weeks to finish it so try not to stress out about that too much.

00:02:01.551 --> 00:02:05.318
But I'm pretty excited, I think assignment
three will be really cool, has a lot of cool,

00:02:05.318 --> 00:02:09.079
it'll cover a lot of really cool material.

00:02:09.079 --> 00:02:13.340
So another thing, last time in lecture we
mentioned this thing called the Train Game

00:02:13.340 --> 00:02:17.780
which is this really cool thing we've been working
on sort of as a side project a little bit.

00:02:17.780 --> 00:02:24.391
So this is an interactive tool that you guys can
go on and use to explore a little bit the process

00:02:24.391 --> 00:02:27.340
of tuning hyperparameters
in practice so we hope that,

00:02:27.340 --> 00:02:33.119
so this is again totally not required for the course.
Totally optional, but if you do we will offer

00:02:33.119 --> 00:02:35.072
a small amount of extra
credit for those of you

00:02:35.072 --> 00:02:37.963
who want to do well and
participate on this.

00:02:37.963 --> 00:02:42.224
And we'll send out exactly some more
details later this afternoon on Piazza.

00:02:42.224 --> 00:02:48.362
But just a bit of a demo for what exactly is this thing.
So you'll get to go in and we've changed the name

00:02:48.362 --> 00:02:51.752
from Train Game to HyperQuest
because you're questing

00:02:51.752 --> 00:02:54.464
to solve, to find the best
hyperparameters for your model

00:02:54.464 --> 00:02:59.344
so this is really cool, it'll be an interactive tool that
you can use to explore the training of hyperparameters

00:02:59.344 --> 00:03:01.254
interactively in your browser.

00:03:01.254 --> 00:03:04.871
So you'll login with
your student ID and name.

00:03:04.871 --> 00:03:08.830
You'll fill out a little survey with some
of your experience on deep learning

00:03:08.830 --> 00:03:14.934
then you'll read some instructions. So in this
game you'll be shown some random data set

00:03:14.934 --> 00:03:16.152
on every trial.

00:03:16.152 --> 00:03:21.494
This data set might be images or it might be vectors
and your goal is to train a model by picking

00:03:21.494 --> 00:03:25.632
the right hyperparameters interactively to
perform as well as you can on the validation set

00:03:25.632 --> 00:03:28.077
of this random data set.

00:03:28.077 --> 00:03:31.382
And it'll sort of keep track of your performance
over time and there'll be a leaderboard,

00:03:31.382 --> 00:03:33.423
it'll be really cool.

00:03:33.423 --> 00:03:38.723
So every time you play the game, you'll
get some statistics about your data set.

00:03:38.723 --> 00:03:42.397
In this case we're doing a
classification problem with 10 classes.

00:03:43.424 --> 00:03:47.774
You can see down at the bottom you have these
statistics about random data set, we have 10 classes.

00:03:47.774 --> 00:03:52.987
The input data size is three by 32 by 32 so
this is some image data set and we can see

00:03:52.987 --> 00:03:58.832
that in this case we have 8500 examples in the
training set and 1500 examples in the validation set.

00:03:58.832 --> 00:04:01.518
These are all random, they'll change
a little bit every time.

00:04:01.518 --> 00:04:06.912
Based on these data set statistics you'll make some choices
on your initial learning rate, your initial network size,

00:04:06.912 --> 00:04:08.931
and your initial dropout rate.

00:04:08.931 --> 00:04:13.811
Then you'll see a screen like this where it'll
run one epoch with those chosen hyperparameters,

00:04:13.811 --> 00:04:19.712
show you on the right here you'll see two
plots. One is your training and validation loss

00:04:19.712 --> 00:04:21.040
for that first epoch.

00:04:21.040 --> 00:04:23.409
Then you'll see your training
and validation accuracy

00:04:23.409 --> 00:04:30.759
for that first epoch and based on the gaps that you see in these two graphs you
can make choices interactively to change the learning rates and hyperparameters

00:04:30.759 --> 00:04:32.290
for the next epoch.

00:04:32.290 --> 00:04:37.803
So then you can either choose to continue training
with the current or changed hyperparameters,

00:04:37.803 --> 00:04:41.523
you can also stop training, or you can
revert to go back to the previous checkpoint

00:04:41.523 --> 00:04:43.872
in case things got really messed up.

00:04:43.872 --> 00:04:48.691
So then you'll get to make some choice,
so here we'll decide to continue training

00:04:48.691 --> 00:04:51.347
and in this case you could
go and set new learning rates

00:04:51.347 --> 00:04:54.971
and new hyperparameters for
the next epoch of training.

00:04:54.971 --> 00:04:59.808
You can also, kind of interesting here, you
can actually grow the network interactively

00:04:59.808 --> 00:05:01.899
during training in this demo.

00:05:01.899 --> 00:05:07.562
There's this cool trick from a couple recent
papers where you can either take existing layers

00:05:07.562 --> 00:05:12.083
and make them wider or add new layers to the network
in the middle of training while still maintaining

00:05:12.083 --> 00:05:15.762
the same function in the
network so you can do that

00:05:15.762 --> 00:05:20.131
to increase the size of your network in the
middle of training here which is kind of cool.

00:05:20.131 --> 00:05:24.430
So then you'll make choices over several epochs
and eventually your final validation accuracy

00:05:24.430 --> 00:05:26.811
will be recorded and we'll
have some leaderboard

00:05:26.811 --> 00:05:29.912
that compares your score on that data set

00:05:29.912 --> 00:05:33.072
to some simple baseline models.

00:05:33.072 --> 00:05:37.534
And depending on how well you do on this leaderboard
we'll again offer some small amounts of extra credit

00:05:37.534 --> 00:05:39.774
for those of you who
choose to participate.

00:05:39.774 --> 00:05:42.322
So this is again, totally
optional, but I think

00:05:42.322 --> 00:05:46.936
it can be a really cool learning experience for you guys
to play around with and explore how hyperparameters

00:05:46.936 --> 00:05:49.243
affect the learning process.

00:05:49.243 --> 00:05:54.872
Also, it's really useful for us. You'll help
science out by participating in this experiment.

00:05:54.872 --> 00:06:02.101
We're pretty interested in seeing how people behave when
they train neural networks so you'll be helping us out

00:06:02.101 --> 00:06:04.422
as well if you decide to play this.

00:06:04.422 --> 00:06:08.462
But again, totally optional, up to you.

00:06:08.462 --> 00:06:10.295
Any questions on that?

00:06:15.080 --> 00:06:18.680
Hopefully at some point but it's.
So the question was will this be a paper

00:06:18.680 --> 00:06:20.272
or whatever eventually?

00:06:20.272 --> 00:06:26.760
Hopefully but it's really early stages of this
project so I can't make any promises but I hope so.

00:06:26.760 --> 00:06:29.510
But I think it'll be really cool.

00:06:33.240 --> 00:06:35.000
[laughing]

00:06:35.000 --> 00:06:37.971
Yeah, so the question is how can
you add layers during training?

00:06:37.971 --> 00:06:43.552
I don't really want to get into that right now but
the paper to read is Net2Net by Ian Goodfellow's

00:06:43.552 --> 00:06:45.291
one of the authors and
there's another paper

00:06:45.291 --> 00:06:48.240
from Microsoft called Network Morphism.

00:06:48.240 --> 00:06:52.407
So if you read those two papers
you can see how this works.

00:06:53.680 --> 00:06:58.152
Okay, so last time, a bit of a reminder
before we had the midterm last time we talked

00:06:58.152 --> 00:06:59.792
about recurrent neural networks.

00:06:59.792 --> 00:07:03.032
We saw that recurrent neural networks can
be used for different types of problems.

00:07:03.032 --> 00:07:07.192
In addition to one to one we can do one
to many, many to one, many to many.

00:07:07.192 --> 00:07:10.679
We saw how this can apply
to language modeling

00:07:10.679 --> 00:07:15.460
and we saw some cool examples of applying neural networks to
model different sorts of languages at the character level

00:07:15.460 --> 00:07:20.571
and we sampled these artificial math
and Shakespeare and C source code.

00:07:20.571 --> 00:07:26.560
We also saw how similar things could be applied to
image captioning by connecting a CNN feature extractor

00:07:26.560 --> 00:07:28.491
together with an RNN language model.

00:07:28.491 --> 00:07:31.011
And we saw some really
cool examples of that.

00:07:31.011 --> 00:07:36.040
We also talked about the different types of
RNN's. We talked about this Vanilla RNN.

00:07:36.040 --> 00:07:40.158
I also want to mention that this is sometimes
called a Simple RNN or an Elman RNN so you'll see

00:07:40.158 --> 00:07:42.331
all of these different
terms in literature.

00:07:42.331 --> 00:07:44.997
We also talked about the Long
Short Term Memory or LSTM.

00:07:44.997 --> 00:07:50.102
And we talked about how the gradient,
the LSTM has this crazy set of equations

00:07:50.102 --> 00:07:53.021
but it makes sense because it
helps improve gradient flow

00:07:53.021 --> 00:07:56.022
during back propagation
and helps this thing model

00:07:56.022 --> 00:07:59.443
more longer term dependencies
in our sequences.

00:07:59.443 --> 00:08:03.982
So today we're going to switch gears and talk
about a whole bunch of different exciting tasks.

00:08:03.982 --> 00:08:08.992
We're going to talk about, so so far we've been talking
about mostly the image classification problem.

00:08:08.992 --> 00:08:13.262
Today we're going to talk about various types of other
computer vision tasks where you actually want to go in

00:08:13.262 --> 00:08:19.542
and say things about the spatial pixels inside your images
so we'll see segmentation, localization, detection,

00:08:19.542 --> 00:08:21.942
a couple other different
computer vision tasks

00:08:21.942 --> 00:08:25.494
and how you can approach these
with convolutional neural networks.

00:08:25.494 --> 00:08:29.552
So as a bit of refresher, so far the main
thing we've been talking about in this class

00:08:29.552 --> 00:08:32.163
is image classification so
here we're going to have

00:08:32.163 --> 00:08:34.842
some input image come in.
That input image will go through

00:08:34.842 --> 00:08:36.583
some deep convolutional network,

00:08:36.583 --> 00:08:42.991
that network will give us some feature vector of
maybe 4096 dimensions in the case of AlexNet RGB

00:08:42.991 --> 00:08:46.222
and then from that final feature vector
we'll have some fully-connected,

00:08:46.222 --> 00:08:47.750
some final fully-connected layer

00:08:47.750 --> 00:08:50.568
that gives us 1000 numbers
for the different class scores

00:08:50.568 --> 00:08:55.660
that we care about where 1000 is maybe the
number of classes in ImageNet in this example.

00:08:55.660 --> 00:08:59.080
And then at the end of the day
what the network does is we input an image

00:08:59.080 --> 00:09:01.437
and then we output a single category label

00:09:01.437 --> 00:09:05.083
saying what is the content of
this entire image as a whole.

00:09:05.083 --> 00:09:09.879
But this is maybe the most basic possible task
in computer vision and there's a whole bunch

00:09:09.879 --> 00:09:11.686
of other interesting types of tasks

00:09:11.686 --> 00:09:14.314
that we might want to
solve using deep learning.

00:09:14.314 --> 00:09:18.609
So today we're going to talk about several of these
different tasks and step through each of these

00:09:18.609 --> 00:09:21.515
and see how they all
work with deep learning.

00:09:21.515 --> 00:09:26.944
So we'll talk about these more in detail
about what each problem is as we get to it

00:09:26.944 --> 00:09:28.852
but this is kind of a summary slide

00:09:28.852 --> 00:09:31.480
that we'll talk first about
semantic segmentation.

00:09:31.480 --> 00:09:35.153
We'll talk about classification and localization,
then we'll talk about object detection,

00:09:35.153 --> 00:09:39.086
and finally a couple brief words
about instance segmentation.

00:09:39.967 --> 00:09:44.035
So first is the problem
of semantic segmentation.

00:09:44.035 --> 00:09:49.847
In the problem of semantic segmentation, we want
to input an image and then output a decision

00:09:49.847 --> 00:09:52.567
of a category for every
pixel in that image

00:09:52.567 --> 00:09:58.327
so for every pixel in this, so this input image for example
is this cat walking through the field, he's very cute.

00:09:58.327 --> 00:10:04.517
And in the output we want to say for every pixel
is that pixel a cat or grass or sky or trees

00:10:04.517 --> 00:10:07.701
or background or some
other set of categories.

00:10:07.701 --> 00:10:11.922
So we're going to have some set of categories
just like we did in the image classification case

00:10:11.922 --> 00:10:15.820
but now rather than assigning a single category
labeled to the entire image, we want to produce

00:10:15.820 --> 00:10:19.569
a category label for each
pixel of the input image.

00:10:19.569 --> 00:10:22.674
And this is called semantic segmentation.

00:10:22.674 --> 00:10:27.340
So one interesting thing about semantic segmentation
is that it does not differentiate instances

00:10:27.340 --> 00:10:31.523
so in this example on the right we have this image
with two cows where they're standing right next

00:10:31.523 --> 00:10:36.859
to each other and when we're talking about semantic
segmentation we're just labeling all the pixels

00:10:36.859 --> 00:10:39.741
independently for what is
the category of that pixel.

00:10:39.741 --> 00:10:44.510
So in the case like this where we have two cows
right next to each other the output does not make

00:10:44.510 --> 00:10:46.840
any distinguishing, does not distinguish

00:10:46.840 --> 00:10:48.309
between these two cows.

00:10:48.309 --> 00:10:51.782
Instead we just get a whole mass of pixels
that are all labeled as cow.

00:10:51.782 --> 00:10:56.625
So this is a bit of a shortcoming of semantic
segmentation and we'll see how we can fix this later

00:10:56.625 --> 00:10:58.910
when we move to instance segmentation.

00:10:58.910 --> 00:11:02.882
But at least for now we'll just talk about
semantic segmentation first.

00:11:04.437 --> 00:11:09.340
So you can imagine maybe using a class,
so one potential approach for attacking

00:11:09.340 --> 00:11:12.544
semantic segmentation might
be through classification.

00:11:12.544 --> 00:11:17.755
So there's this, you could use this idea of a
sliding window approach to semantic segmentation.

00:11:17.755 --> 00:11:24.315
So you might imagine that we take our input image and
we break it up into many many small, tiny local crops

00:11:24.315 --> 00:11:27.763
of the image so in this
example we've taken

00:11:27.763 --> 00:11:31.310
maybe three crops from
around the head of this cow

00:11:31.310 --> 00:11:36.564
and then you could imagine taking each of those crops
and now treating this as a classification problem.

00:11:36.564 --> 00:11:41.246
Saying for this crop, what is the category
of the central pixel of the crop?

00:11:41.246 --> 00:11:46.752
And then we could use all the same machinery that
we've developed for classifying entire images

00:11:46.752 --> 00:11:48.760
but now just apply it on crops rather than

00:11:48.760 --> 00:11:51.083
on the entire image.

00:11:51.083 --> 00:11:56.601
And this would probably work to some extent
but it's probably not a very good idea.

00:11:56.601 --> 00:12:02.498
So this would end up being super super
computationally expensive because we want to label

00:12:02.498 --> 00:12:07.319
every pixel in the image, we would need a separate
crop for every pixel in that image and this would be

00:12:07.319 --> 00:12:09.407
super super expensive to
run forward and backward

00:12:09.407 --> 00:12:10.910
passes through.

00:12:10.910 --> 00:12:17.085
And moreover, we're actually, if you think about this
we can actually share computation between different

00:12:17.085 --> 00:12:20.476
patches so if you're trying
to classify two patches

00:12:20.476 --> 00:12:22.950
that are right next to each
other and actually overlap

00:12:22.950 --> 00:12:25.509
then the convolutional
features of those patches

00:12:25.509 --> 00:12:30.611
will end up going through the same convolutional layers
and we can actually share a lot of the computation

00:12:30.611 --> 00:12:32.644
when applying this to separate passes

00:12:32.644 --> 00:12:34.742
or when applying this type of approach

00:12:34.742 --> 00:12:37.194
to separate patches in the image.

00:12:37.194 --> 00:12:41.896
So this is actually a terrible idea and nobody
does this and you should probably not do this

00:12:41.896 --> 00:12:48.683
but it's at least the first thing you might think of if
you were trying to think about semantic segmentation.

00:12:48.683 --> 00:12:53.372
Then the next idea that works a bit better is
this idea of a fully convolutional network right.

00:12:53.372 --> 00:12:58.305
So rather than extracting individual patches from the
image and classifying these patches independently,

00:12:58.305 --> 00:13:03.604
we can imagine just having our network be a whole giant
stack of convolutional layers with no fully connected

00:13:03.604 --> 00:13:06.501
layers or anything so in this
case we just have a bunch

00:13:06.501 --> 00:13:12.633
of convolutional layers that are all maybe three
by three with zero padding or something like that

00:13:12.633 --> 00:13:15.422
so that each convolutional
layer preserves the spatial size

00:13:15.422 --> 00:13:17.843
of the input and now if we pass our image

00:13:17.843 --> 00:13:20.605
through a whole stack of
these convolutional layers,

00:13:20.605 --> 00:13:27.184
then the final convolutional layer could just
output a tensor of something by C by H by W

00:13:27.184 --> 00:13:29.622
where C is the number of
categories that we care about

00:13:29.622 --> 00:13:34.734
and you could see this tensor as just giving
our classification scores for every pixel

00:13:34.734 --> 00:13:38.127
in the input image at every
location in the input image.

00:13:38.127 --> 00:13:43.014
And we could compute this all at once with
just some giant stack of convolutional layers.

00:13:43.014 --> 00:13:47.216
And then you could imagine training this thing
by putting a classification loss at every pixel

00:13:47.216 --> 00:13:50.558
of this output, taking an
average over those pixels

00:13:50.558 --> 00:13:55.137
in space, and just training this kind of network
through normal, regular back propagation.

00:13:55.137 --> 00:13:55.970
Question?

00:13:58.430 --> 00:14:01.179
Oh, the question is how do you develop
training data for this?

00:14:01.179 --> 00:14:04.366
It's very expensive right.
So the training data for this would be

00:14:04.366 --> 00:14:06.899
we need to label every
pixel in those input images

00:14:06.899 --> 00:14:11.831
so there's tools that people sometimes have online
where you can go in and sort of draw contours

00:14:11.831 --> 00:14:14.613
around the objects and
then fill in regions

00:14:14.613 --> 00:14:17.604
but in general getting this kind of
training data is very expensive.

00:14:29.243 --> 00:14:31.357
Yeah, the question is
what is the loss function?

00:14:31.357 --> 00:14:37.009
So here since we're making a classification
decision per pixel then we put a cross entropy loss

00:14:37.009 --> 00:14:39.025
on every pixel of the output.

00:14:39.025 --> 00:14:42.212
So we have the ground truth category label
for every pixel in the output,

00:14:42.212 --> 00:14:45.793
then we compute across entropy loss
between every pixel in the output

00:14:45.793 --> 00:14:48.143
and the ground truth pixels and then

00:14:48.143 --> 00:14:52.739
take either a sum or an average over space
and then sum or average over the mini-batch.

00:14:52.739 --> 00:14:53.572
Question?

00:15:18.548 --> 00:15:26.505
Yeah, yeah.
Yeah, the question is do we assume

00:15:26.505 --> 00:15:28.008
that we know the categories?

00:15:28.008 --> 00:15:31.258
So yes, we do assume that we
know the categories up front

00:15:31.258 --> 00:15:33.716
so this is just like the
image classification case.

00:15:33.716 --> 00:15:39.466
So an image classification we know at the start of
training based on our data set that maybe there's 10 or 20

00:15:39.466 --> 00:15:41.357
or 100 or 1000 classes that we care about

00:15:41.357 --> 00:15:50.077
for this data set and then here we are fixed to that
set of classes that are fixed for the data set.

00:15:51.012 --> 00:15:56.206
So this model is relatively simple and you
can imagine this working reasonably well

00:15:56.206 --> 00:15:58.853
assuming that you tuned all
the hyperparameters right

00:15:58.853 --> 00:16:00.562
but it's kind of a problem right.

00:16:00.562 --> 00:16:05.120
So in this setup, since we're applying a bunch
of convolutions that are all keeping the same

00:16:05.120 --> 00:16:07.479
spatial size of the input image,

00:16:07.479 --> 00:16:09.574
this would be super super expensive right.

00:16:09.574 --> 00:16:16.435
If you wanted to do convolutions that maybe have 64 or
128 or 256 channels for those convolutional filters

00:16:16.435 --> 00:16:18.982
which is pretty common in
a lot of these networks,

00:16:18.982 --> 00:16:24.111
then running those convolutions on this high resolution
input image over a sequence of layers would be

00:16:24.111 --> 00:16:25.849
extremely computationally expensive

00:16:25.849 --> 00:16:27.361
and would take a ton of memory.

00:16:27.361 --> 00:16:31.304
So in practice, you don't usually see
networks with this architecture.

00:16:31.304 --> 00:16:37.512
Instead you tend to see networks that look something
like this where we have some downsampling

00:16:37.512 --> 00:16:39.277
and then some upsampling
of the feature map

00:16:39.277 --> 00:16:40.592
inside the image.

00:16:40.592 --> 00:16:44.614
So rather than doing all the convolutions of
the full spatial resolution of the image,

00:16:44.614 --> 00:16:48.997
we'll maybe go through a small number of
convolutional layers at the original resolution

00:16:48.997 --> 00:16:53.991
then downsample that feature map using something
like max pooling or strided convolutions

00:16:53.991 --> 00:16:55.719
and sort of downsample, downsample,

00:16:55.719 --> 00:16:59.338
so we have convolutions in downsampling
and convolutions in downsampling

00:16:59.338 --> 00:17:04.640
that look much like a lot of the classification
networks that you see but now the difference is that

00:17:04.640 --> 00:17:09.346
rather than transitioning to a fully connected layer
like you might do in an image classification setup,

00:17:09.346 --> 00:17:12.071
instead we want to increase
the spatial resolution

00:17:12.071 --> 00:17:15.213
of our predictions in the
second half of the network

00:17:15.214 --> 00:17:20.614
so that our output image can now be the same
size as our input image and this ends up being

00:17:20.614 --> 00:17:22.136
much more computationally efficient

00:17:22.136 --> 00:17:26.417
because you can make the network very deep
and work at a lower spatial resolution

00:17:26.417 --> 00:17:29.749
for many of the layers at
the inside of the network.

00:17:29.749 --> 00:17:36.418
So we've already seen examples of downsampling
when it comes to convolutional networks.

00:17:36.418 --> 00:17:41.180
We've seen that you can do strided convolutions or
various types of pooling to reduce the spatial size

00:17:41.180 --> 00:17:44.050
of the image inside a
network but we haven't

00:17:44.050 --> 00:17:46.040
really talked about
upsampling and the question

00:17:46.040 --> 00:17:51.476
you might be wondering is what are these upsampling
layers actually look like inside the network?

00:17:51.476 --> 00:17:55.875
And what are our strategies for increasing the
size of a feature map inside the network?

00:17:55.875 --> 00:17:59.208
Sorry, was there a question in the back?

00:18:07.316 --> 00:18:09.061
Yeah, so the question
is how do we upsample?

00:18:09.061 --> 00:18:11.758
And the answer is that's the topic
of the next couple slides.

00:18:11.758 --> 00:18:13.263
[laughing]

00:18:13.263 --> 00:18:21.075
So one strategy for upsampling is something like
unpooling so we have this notion of pooling

00:18:21.075 --> 00:18:23.379
to downsample so we talked
about average pooling

00:18:23.379 --> 00:18:26.187
or max pooling so when we
talked about average pooling

00:18:26.187 --> 00:18:30.389
we're kind of taking a spatial average within
a receptive field of each pooling region.

00:18:30.389 --> 00:18:34.853
One kind of analog for upsampling is
this idea of nearest neighbor unpooling.

00:18:34.853 --> 00:18:39.090
So here on the left we see this example of
nearest neighbor unpooling where our input

00:18:39.090 --> 00:18:41.379
is maybe some two by
two grid and our output

00:18:41.379 --> 00:18:43.853
is a four by four grid
and now in our output

00:18:43.853 --> 00:18:50.461
we've done a two by two stride two nearest neighbor
unpooling or upsampling where we've just duplicated

00:18:50.461 --> 00:18:53.177
that element for every
point in our two by two

00:18:53.177 --> 00:18:56.149
receptive field of the unpooling region.

00:18:56.149 --> 00:19:03.472
Another thing you might see is this bed of nails unpooling
or bed of nails upsampling where you'll just take,

00:19:03.472 --> 00:19:09.116
again we have a two by two receptive field for
our unpooling regions and then you'll take the,

00:19:09.116 --> 00:19:23.462
in this case you make it all zeros except for one element of the unpooling region so in this case we've taken all of
our inputs and always put them in the upper left hand corner of this unpooling region and everything else is zeros.

00:19:23.463 --> 00:19:24.867
And this is kind of like a bed of nails

00:19:24.867 --> 00:19:33.559
because the zeros are very flat, then you've got these things
poking up for the values at these various non-zero regions.

00:19:33.560 --> 00:19:39.591
Another thing that you see sometimes which was alluded to
by the question a minute ago is this idea of max unpooling

00:19:39.591 --> 00:19:52.046
so in a lot of these networks they tend to be symmetrical where we have a downsampling portion of the network
and then an upsampling portion of the network with a symmetry between those two portions of the network.

00:19:52.047 --> 00:20:06.139
So sometimes what you'll see is this idea of max unpooling where for each unpooling, for each upsampling layer,
it is associated with one of the pooling layers in the first half of the network and now in the first half,

00:20:06.140 --> 00:20:16.464
in the downsampling when we do max pooling we'll actually remember which element
of the receptive field during max pooling was used to do the max pooling

00:20:16.465 --> 00:20:26.390
and now when we go through the rest of the network then we'll do something that looks like this bed of nails
upsampling except rather than always putting the elements in the same position, instead we'll stick it

00:20:26.391 --> 00:20:33.697
into the position that was used in the corresponding
max pooling step earlier in the network.

00:20:33.697 --> 00:20:38.321
I'm not sure if that explanation was clear
but hopefully the picture makes sense.

00:20:39.248 --> 00:20:42.388
Yeah, so then you just end up
filling the rest with zeros.

00:20:42.388 --> 00:20:48.256
So then you fill the rest with zeros and then you stick the elements
from the low resolution patch up into the high resolution patch

00:20:48.256 --> 00:20:54.964
at the points where the max pooling took place
at the corresponding max pooling there.

00:20:56.871 --> 00:21:00.723
Okay, so that's kind
of an interesting idea.

00:21:00.723 --> 00:21:02.056
Sorry, question?

00:21:08.696 --> 00:21:11.801
Oh yeah, so the question is why is this
a good idea? Why might this matter?

00:21:11.801 --> 00:21:16.806
So the idea is that when we're doing semantic segmentation
we want our predictions to be pixel perfect right.

00:21:16.806 --> 00:21:23.708
We kind of want to get those sharp boundaries and
those tiny details in our predictive segmentation

00:21:23.708 --> 00:21:31.782
so now if you're doing this max pooling, there's this sort of heterogeneity
that's happening inside the feature map due to the max pooling

00:21:31.782 --> 00:21:44.363
where from the low resolution image you don't know, you're sort of losing spatial information in some sense
by you don't know where that feature vector came from in the local receptive field after max pooling.

00:21:45.253 --> 00:21:53.759
So if you actually unpool by putting the vector in the same slot you might
think that that might help us handle these fine details a little bit better

00:21:53.759 --> 00:21:59.051
and help us preserve some of that spatial
information that was lost during max pooling.

00:21:59.051 --> 00:21:59.884
Question?

00:22:10.883 --> 00:22:13.809
The question is does this make
things easier for back prop?

00:22:13.809 --> 00:22:21.009
Yeah, I guess, I don't think it changes the back prop dynamics too much
because storing these indices is not a huge computational overhead.

00:22:21.009 --> 00:22:24.851
They're pretty small in
comparison to everything else.

00:22:24.851 --> 00:22:29.566
So another thing that you'll see sometimes
is this idea of transpose convolution.

00:22:29.566 --> 00:22:34.724
So transpose convolution, so for these various
types of unpooling that we just talked about,

00:22:34.724 --> 00:22:38.945
these bed of nails, this nearest neighbor,
this max unpooling, all of these are kind of

00:22:38.945 --> 00:22:44.964
a fixed function, they're not really learning exactly how
to do the upsampling so if you think about something

00:22:44.964 --> 00:22:47.404
like strided convolution,
strided convolution

00:22:47.404 --> 00:22:54.423
is kind of like a learnable layer that learns the way that
the network wants to perform downsampling at that layer.

00:22:54.423 --> 00:23:02.534
And by analogy with that there's this type of layer called a
transpose convolution that lets us do kind of learnable upsampling.

00:23:02.534 --> 00:23:08.068
So it will both upsample the feature map and learn
some weights about how it wants to do that upsampling.

00:23:08.068 --> 00:23:13.262
And this is really just another type of convolution
so to see how this works remember how a normal

00:23:13.262 --> 00:23:16.663
three by three stride one pad
one convolution would work.

00:23:16.663 --> 00:23:20.488
That for this kind of normal convolution that
we've seen many times now in this class,

00:23:20.488 --> 00:23:24.316
our input might by four by four,
our output might be four by four,

00:23:24.316 --> 00:23:29.721
and now we'll have this three by three kernel and we'll take an inner
product between, we'll plop down that kernel at the corner of the image,

00:23:29.721 --> 00:23:35.409
take an inner product, and that inner product will give us the value
and the activation in the upper left hand corner of our output.

00:23:35.409 --> 00:23:39.388
And we'll repeat this for every
receptive field in the image.

00:23:39.388 --> 00:23:44.688
Now if we talk about strided convolution then
strided convolution ends up looking pretty similar.

00:23:44.688 --> 00:23:49.648
However, our input is maybe a four by four
region and our output is a two by two region.

00:23:49.648 --> 00:24:00.808
But we still have this idea of taking, of there being some three by three filter or kernel that we plop down in
the corner of the image, take an inner product and use that to compute a value of the activation and the output.

00:24:00.808 --> 00:24:08.879
But now with strided convolution the idea is that we're moving that,
rather than popping down that filter at every possible point in the input,

00:24:08.879 --> 00:24:16.961
instead we're going to move the filter by two pixels in the input every time we
move the filter by one pixel, every time we move by one pixel in the output.

00:24:16.961 --> 00:24:23.361
Right so this stride of two gives us a ratio between how much do
we move in the input versus how much do we move in the output.

00:24:23.361 --> 00:24:32.495
So when you do a strided convolution with stride two this ends up downsampling
the image or the feature map by a factor of two in kind of a learnable way.

00:24:32.495 --> 00:24:42.638
And now a transpose convolution is sort of the opposite in a way so here our
input will be a two by two region and our output will be a four by four region.

00:24:42.638 --> 00:24:46.904
But now the operation that we perform with
transpose convolution is a little bit different.

00:24:46.904 --> 00:24:56.074
Now so rather than taking an inner product instead what we're going
to do is we're going to take the value of our input feature map

00:24:56.074 --> 00:25:00.856
at that upper left hand corner and that'll be
some scalar value in the upper left hand corner.

00:25:00.856 --> 00:25:06.767
We're going to multiply the filter by that scalar value
and then copy those values over to this three by three

00:25:06.767 --> 00:25:14.428
region in the output so rather than taking an inner
product with our filter and the input, instead our input

00:25:14.428 --> 00:25:24.911
gives weights that we will use to weight the filter and then our output will
be weighted copies of the filter that are weighted by the values in the input.

00:25:24.911 --> 00:25:36.703
And now we can do this sort of same ratio trick in order to upsample so now when we move one pixel
in the input now we can plop our filter down two pixels away in the output and it's the same trick

00:25:36.703 --> 00:25:43.713
that now the blue pixel in the input is some scalar value and we'll
take that scalar value, multiply it by the values in the filter,

00:25:43.713 --> 00:25:49.048
and copy those weighted filter values
into this new region in the output.

00:25:49.048 --> 00:25:54.765
The tricky part is that sometimes these receptive
fields in the output can overlap now and now when these

00:25:54.765 --> 00:26:00.143
receptive fields in the output overlap
we just sum the results in the output.

00:26:00.143 --> 00:26:07.931
So then you can imagine repeating this everywhere and repeating this
process everywhere and this ends up doing sort of a learnable upsampling

00:26:07.931 --> 00:26:14.466
where we use these learned convolutional filter weights
to upsample the image and increase the spatial size.

00:26:15.609 --> 00:26:19.975
By the way, you'll see this operation go
by a lot of different names in literature.

00:26:19.975 --> 00:26:24.153
Sometimes this gets called
things like deconvolution

00:26:24.153 --> 00:26:27.024
which I think is kind of a
bad name but you'll see it

00:26:27.024 --> 00:26:34.066
out there in papers so from a signal processing perspective
deconvolution means the inverse operation to convolution

00:26:34.066 --> 00:26:39.945
which this is not however you'll frequently see
this type of layer called a deconvolution layer

00:26:39.945 --> 00:26:44.121
in some deep learning papers so be aware
of that, watch out for that terminology.

00:26:44.121 --> 00:26:48.280
You'll also sometimes see this called
upconvolution which is kind of a cute name.

00:26:48.280 --> 00:26:51.490
Sometimes it gets called
fractionally strided convolution

00:26:51.490 --> 00:27:01.437
because if we think of the stride as the ratio in step between the input and the output
then now this is something like a stride one half convolution because of this ratio

00:27:01.437 --> 00:27:04.869
of one to two between steps in the input
and steps in the output.

00:27:04.869 --> 00:27:09.311
This also sometimes gets called a backwards
strided convolution because if you think about it

00:27:09.311 --> 00:27:15.287
and work through the math this ends up being the
same, the forward pass of a transpose convolution

00:27:15.287 --> 00:27:20.030
ends up being the same mathematical operation
as the backwards pass in a normal convolution

00:27:20.030 --> 00:27:28.698
so you might have to take my word for it, that might not be super obvious when you first
look at this but that's kind of a neat fact so you'll sometimes see that name as well.

00:27:28.698 --> 00:27:36.923
And as maybe a bit of a more concrete example of what this looks like I
think it's maybe a little easier to see in one dimension so if we imagine,

00:27:36.923 --> 00:27:41.272
so here we're doing a three by three
transpose convolution in one dimension.

00:27:41.272 --> 00:27:46.091
Sorry, not three by three, a three by one
transpose convolution in one dimension.

00:27:46.091 --> 00:27:50.211
So our filter here is just three numbers. Our
input is two numbers and now you can see

00:27:50.211 --> 00:27:58.060
that in our output we've taken the values in the input, used them to weight the
values of the filter and plopped down those weighted filters in the output

00:27:58.060 --> 00:28:03.597
with a stride of two and now where these receptive
fields overlap in the output then we sum.

00:28:03.597 --> 00:28:12.253
So you might be wondering, this is kind of a funny name. Where does the name transpose
convolution come from and why is that actually my preferred name for this operation?

00:28:12.253 --> 00:28:15.530
So that comes from this kind of
neat interpretation of convolution.

00:28:15.530 --> 00:28:21.902
So it turns out that any time you do convolution you can
always write convolution as a matrix multiplication.

00:28:21.902 --> 00:28:25.737
So again, this is kind of easier to see
with a one-dimensional example

00:28:25.737 --> 00:28:33.470
but here we've got some weight. So we're doing a one-dimensional
convolution of a weight vector x which has three elements,

00:28:34.497 --> 00:28:38.706
and an input vector, a vector, which
has four elements, A, B, C, D.

00:28:38.706 --> 00:28:47.869
So here we're doing a three by one convolution with stride one and you can
see that we can frame this whole operation as a matrix multiplication

00:28:47.869 --> 00:28:54.781
where we take our convolutional kernel x
and turn it into some matrix capital X

00:28:54.781 --> 00:28:59.360
which contains copies of that convolutional
kernel that are offset by different regions.

00:28:59.360 --> 00:29:08.157
And now we can take this giant weight matrix X and do a matrix vector multiplication
between x and our input a and this just produces the same result as convolution.

00:29:09.274 --> 00:29:17.770
And now with transpose convolution means that we're going to take this same weight
matrix but now we're going to multiply by the transpose of that same weight matrix.

00:29:17.770 --> 00:29:26.491
So here you can see the same example for this stride one convolution on the
left and the corresponding stride one transpose convolution on the right.

00:29:26.491 --> 00:29:31.018
And if you work through the details you'll
see that when it comes to stride one,

00:29:31.018 --> 00:29:37.570
a stride one transpose convolution also ends up being a
stride one normal convolution so there's a little bit

00:29:37.570 --> 00:29:42.334
of details in the way that the border and the padding
are handled but it's fundamentally the same operation.

00:29:42.334 --> 00:29:45.879
But now things look different
when you talk about a stride of two.

00:29:45.879 --> 00:29:54.240
So again, here on the left we can take a stride two convolution and
write out this stride two convolution as a matrix multiplication.

00:29:54.240 --> 00:29:59.837
And now the corresponding transpose convolution
is no longer a convolution so if you look

00:29:59.837 --> 00:30:04.985
through this weight matrix and think about how
convolutions end up getting represented in this way

00:30:04.985 --> 00:30:13.913
then now this transposed matrix for the stride two convolution is something
fundamentally different from the original normal convolution operation

00:30:13.913 --> 00:30:20.647
so that's kind of the reasoning behind the name and that's why I
think that's kind of the nicest name to call this operation by.

00:30:20.647 --> 00:30:22.980
Sorry, was there a question?

00:30:27.991 --> 00:30:29.646
Sorry?

00:30:29.646 --> 00:30:36.523
It's very possible there's a typo in the slide so please point
out on Piazza and I'll fix it but I hope the idea was clear.

00:30:36.523 --> 00:30:43.000
Is there another question? Okay, thank you
[laughing]. Yeah, so, oh no lots of questions.

00:30:53.576 --> 00:30:56.360
Yeah, so the issue is why
do we sum and not average?

00:30:56.360 --> 00:31:03.404
So the reason we sum is due to this transpose convolution
formula zone so that's the reason why we sum

00:31:03.404 --> 00:31:11.325
but you're right that you actually, this is kind of a problem that the magnitudes will
actually vary in the output depending on how many receptive fields were in the output.

00:31:11.325 --> 00:31:15.322
So actually in practice this is something that people
started to point out very recently and somewhat

00:31:15.322 --> 00:31:26.250
switched away from this stride, so using three by three stride two transpose convolution upsampling
can sometimes produce some checkerboard artifacts in the output exactly due to that problem.

00:31:26.250 --> 00:31:37.127
So what I've seen in a couple more recent papers is maybe to use four by four stride two or two by two
stride two transpose convolution for upsampling and that helps alleviate that problem a little bit.

00:31:46.834 --> 00:31:52.515
- Yeah, so the question is what is a stride half convolution
- and where does that terminology come from?

00:31:52.515 --> 00:31:56.790
I think that was from my paper. So that was
actually, yes that was definitely this.

00:31:56.790 --> 00:32:01.181
So at the time I was writing that paper I was kind
of into the name fractionally strided convolution

00:32:01.181 --> 00:32:07.282
but after thinking about it a bit more I think
transpose convolution is probably the right name.

00:32:07.282 --> 00:32:13.746
So then this idea of semantic segmentation
actually ends up being pretty natural.

00:32:13.746 --> 00:32:19.540
You just have this giant convolutional network with
downsampling and upsampling inside the network

00:32:19.540 --> 00:32:22.053
and now our downsampling will
be by strided convolution

00:32:22.053 --> 00:32:28.035
or pooling, our upsampling will be by transpose
convolution or various types of unpooling or upsampling

00:32:28.035 --> 00:32:33.634
and we can train this whole thing end to end with back
propagation using this cross entropy loss over every pixel.

00:32:33.634 --> 00:32:41.514
So this is actually pretty cool that we can take a lot of the same machinery
that we already learned for image classification and now just apply it

00:32:41.514 --> 00:32:45.414
very easily to extend to new types
of problems so that's super cool.

00:32:46.333 --> 00:32:52.024
So the next task that I want to talk about is
this idea of classification plus localization.

00:32:52.024 --> 00:32:54.953
So we've talked about
image classification a lot

00:32:54.953 --> 00:33:01.234
where we want to just assign a category label to the input image but
sometimes you might want to know a little bit more about the image.

00:33:01.234 --> 00:33:09.077
In addition to predicting what the category is, in this case the
cat, you might also want to know where is that object in the image?

00:33:09.077 --> 00:33:17.874
So in addition to predicting the category label cat, you might also
want to draw a bounding box around the region of the cat in that image.

00:33:17.874 --> 00:33:22.713
And classification plus localization, the
distinction here between this and object detection

00:33:22.713 --> 00:33:31.242
is that in the localization scenario you assume ahead of time that you know there's
exactly one object in the image that you're looking for or maybe more than one

00:33:31.242 --> 00:33:41.001
but you know ahead of time that we're going to make some classification decision about this image and
we're going to produce exactly one bounding box that's going to tell us where that object is located

00:33:41.001 --> 00:33:47.584
in the image so we sometimes call that
task classification plus localization.

00:33:47.584 --> 00:33:53.680
And again, we can reuse a lot of the same machinery that we've already
learned from image classification in order to tackle this problem.

00:33:53.680 --> 00:33:58.220
So kind of a basic architecture for
this problem looks something like this.

00:33:58.220 --> 00:34:09.301
So again, we have our input image, we feed our input image through some giant convolutional network, this is
Alex, this is AlexNet for example, which will give us some final vector summarizing the content of the image.

00:34:09.301 --> 00:34:15.730
Then just like before we'll have some fully connected layer
that goes from that final vector to our class scores.

00:34:15.730 --> 00:34:21.109
But now we'll also have another fully connected
layer that goes from that vector to four numbers.

00:34:21.109 --> 00:34:28.478
Where the four numbers are something like the height, the
width, and the x and y positions of that bounding box.

00:34:28.478 --> 00:34:34.228
And now our network will produce these two different
outputs, one is this set of class scores,

00:34:34.228 --> 00:34:39.094
and the other are these four numbers giving the
coordinates of the bounding box in the input image.

00:34:39.094 --> 00:34:44.489
And now during training time, when we train this network
we'll actually have two losses so in this scenario

00:34:44.489 --> 00:34:47.210
we're sort of assuming a
fully supervised setting

00:34:47.210 --> 00:34:55.330
so we assume that each of our training images is annotated with both a category
label and also a ground truth bounding box for that category in the image.

00:34:55.331 --> 00:34:57.118
So now we have two loss functions.

00:34:57.118 --> 00:35:03.360
We have our favorite softmax loss that we compute using the
ground truth category label and the predicted class scores,

00:35:03.360 --> 00:35:13.669
and we also have some kind of loss that gives us some measure of dissimilarity between our
predicted coordinates for the bounding box and our actual coordinates for the bounding box.

00:35:13.669 --> 00:35:20.509
So one very simple thing is to just take an L2 loss between those two and that's
kind of the simplest thing that you'll see in practice although sometimes

00:35:20.509 --> 00:35:27.728
people play around with this and maybe use L1 or smooth L1 or they parametrize
the bounding box a little bit differently but the idea is always the same,

00:35:27.728 --> 00:35:35.509
that you have some regression loss between your predicted bounding
box coordinates and the ground truth bounding box coordinates.

00:35:35.509 --> 00:35:39.510
Question?
Sorry, go ahead.

00:35:49.410 --> 00:35:52.193
So the question is, is this a good idea
to do all at the same time?

00:35:52.193 --> 00:35:55.600
Like what happens if you misclassify, should
you even look at the box coordinates?

00:35:55.600 --> 00:35:59.901
So sometimes people get fancy with it,
so in general it works okay.

00:35:59.901 --> 00:36:03.652
It's not a big problem, you can actually train a
network to do both of these things at the same time

00:36:03.652 --> 00:36:09.592
and it'll figure it out but sometimes things can get tricky
in terms of misclassification so sometimes what you'll see

00:36:09.592 --> 00:36:19.232
for example is that rather than predicting a single box you might make predictions
like a separate prediction of the box for each category and then only apply loss

00:36:19.232 --> 00:36:24.091
to the predicted box corresponding
to the ground truth category.

00:36:24.091 --> 00:36:28.318
So people do get a little bit fancy with these
things that sometimes helps a bit in practice.

00:36:28.318 --> 00:36:34.611
But at least this basic setup, it might not be perfect or it
might not be optimal but it will work and it will do something.

00:36:34.611 --> 00:36:37.361
Was there a question in the back?

00:36:41.226 --> 00:36:46.746
Yeah, so that's the question is do these losses have
different units, do they dominate the gradient?

00:36:46.746 --> 00:36:49.306
So this is what we call a multi-task loss

00:36:49.306 --> 00:36:58.554
so whenever we're taking derivatives we always want to take derivative of a scalar
with respect to our network parameters and use that derivative to take gradient steps.

00:36:58.554 --> 00:37:01.331
But now we've got two scalars
that we want to both minimize

00:37:01.331 --> 00:37:11.833
so what you tend to do in practice is have some additional hyperparameter that gives you some weighting between
these two losses so you'll take a weighted sum of these two different loss functions to give our final scalar loss.

00:37:11.833 --> 00:37:15.642
And then you'll take your gradients with respect
to this weighted sum of the two losses.

00:37:15.642 --> 00:37:23.691
And this ends up being really really tricky because this weighting parameter
is a hyperparameter that you need to set but it's kind of different

00:37:23.691 --> 00:37:27.851
from some of the other hyperparameters
that we've seen so far in the past right

00:37:27.851 --> 00:37:32.390
because this weighting hyperparameter actually
changes the value of the loss function

00:37:32.390 --> 00:37:43.091
so one thing that you might often look at when you're trying to set hyperparameters is you might make
different hyperparameter choices and see what happens to the loss under different choices of hyperparameters.

00:37:43.091 --> 00:37:51.089
But in this case because the loss actually, because the hyperparameter affects
the absolute value of the loss making those comparisons becomes kind of tricky.

00:37:51.089 --> 00:37:54.473
So setting that hyperparameter
is somewhat difficult.

00:37:54.473 --> 00:38:00.393
And in practice, you kind of need to take it on a case by case basis for
exactly the problem you're solving but my general strategy for this

00:38:00.393 --> 00:38:08.163
is to have some other metric of performance that
you care about other than the actual loss value

00:38:08.163 --> 00:38:17.763
which then you actually use that final performance metric to make your cross
validation choices rather than looking at the value of the loss to make those choices.

00:38:17.763 --> 00:38:18.596
Question?

00:38:27.529 --> 00:38:32.682
So the question is why do we do this all
at once? Why not do this separately?

00:38:38.131 --> 00:38:45.413
Yeah, so the question is why don't we fix the big network and then
just only learn separate fully connected layers for these two tasks?

00:38:45.413 --> 00:38:52.702
People do do that sometimes and in fact that's probably the first thing
you should try if you're faced with a situation like this but in general

00:38:52.702 --> 00:39:00.574
whenever you're doing transfer learning you always get better performance if you fine tune
the whole system jointly because there's probably some mismatch between the features,

00:39:00.574 --> 00:39:09.280
if you train on ImageNet and then you use that network for your data set you're going
to get better performance on your data set if you can also change the network.

00:39:09.280 --> 00:39:16.870
But one trick you might see in practice sometimes is that you might freeze
that network then train those two things separately until convergence

00:39:16.870 --> 00:39:20.398
and then after they converge then you go
back and jointly fine tune the whole system.

00:39:20.398 --> 00:39:24.558
So that's a trick that sometimes people do
in practice in that situation.

00:39:24.558 --> 00:39:30.978
And as I've kind of alluded to this big network is often a
pre-trained network that is taken from ImageNet for example.

00:39:31.979 --> 00:39:37.339
So a bit of an aside, this idea of predicting
some fixed number of positions in the image

00:39:37.339 --> 00:39:41.881
can be applied to a lot of different problems
beyond just classification plus localization.

00:39:41.881 --> 00:39:44.710
One kind of cool example
is human pose estimation.

00:39:44.710 --> 00:39:49.440
So here we want to take an input image
is a picture of a person.

00:39:49.440 --> 00:39:56.462
We want to output the positions of the joints for that person and this
actually allows the network to predict what is the pose of the human.

00:39:56.462 --> 00:39:59.030
Where are his arms, where are
his legs, stuff like that,

00:39:59.030 --> 00:40:04.060
and generally most people have the same number of
joints. That's a bit of a simplifying assumption,

00:40:04.060 --> 00:40:06.862
it might not always be true
but it works for the network.

00:40:06.862 --> 00:40:10.251
So for example one
parameterization that you might see

00:40:10.251 --> 00:40:13.451
in some data sets is
define a person's pose

00:40:13.451 --> 00:40:15.430
by 14 joint positions.

00:40:15.430 --> 00:40:16.932
Their feet and their knees and their hips

00:40:16.932 --> 00:40:19.652
and something like that and
now when we train the network

00:40:19.652 --> 00:40:23.150
then we're going to input
this image of a person

00:40:23.150 --> 00:40:27.132
and now we're going to output
14 numbers in this case

00:40:27.132 --> 00:40:30.521
giving the x and y coordinates
for each of those 14 joints.

00:40:30.521 --> 00:40:33.120
And then you apply some
kind of regression loss

00:40:33.120 --> 00:40:35.961
on each of those 14
different predicted points

00:40:35.961 --> 00:40:40.619
and just train this network
with back propagation again.

00:40:40.619 --> 00:40:43.579
Yeah, so you might see an L2
loss but people play around

00:40:43.579 --> 00:40:46.571
with other regression losses here as well.

00:40:46.571 --> 00:40:47.404
Question?

00:40:50.934 --> 00:40:52.432
So the question is what do I mean

00:40:52.432 --> 00:40:53.992
when I say regression loss?

00:40:53.992 --> 00:40:56.099
So I mean something
other than cross entropy

00:40:56.099 --> 00:40:57.294
or softmax right.

00:40:57.294 --> 00:40:59.094
When I say regression loss I usually mean

00:40:59.094 --> 00:41:02.382
like an L2 Euclidean loss or an L1 loss

00:41:02.382 --> 00:41:04.494
or sometimes a smooth L1 loss.

00:41:04.494 --> 00:41:07.512
But in general classification
versus regression

00:41:07.512 --> 00:41:10.502
is whether your output is
categorical or continuous

00:41:10.502 --> 00:41:12.643
so if you're expecting
a categorical output

00:41:12.643 --> 00:41:15.272
like you ultimately want to
make a classification decision

00:41:15.272 --> 00:41:17.243
over some fixed number of categories

00:41:17.243 --> 00:41:19.942
then you'll think about
a cross entropy loss,

00:41:19.942 --> 00:41:23.094
softmax loss or these
SVM margin type losses

00:41:23.094 --> 00:41:25.022
that we talked about already in the class.

00:41:25.022 --> 00:41:28.272
But if your expected output is
to be some continuous value,

00:41:28.272 --> 00:41:30.222
in this case the position of these points,

00:41:30.222 --> 00:41:32.174
then your output is
continuous so you tend to use

00:41:32.174 --> 00:41:34.734
different types of losses
in those situations.

00:41:34.734 --> 00:41:37.883
Typically an L2, L1, different
kinds of things there.

00:41:37.883 --> 00:41:41.482
So sorry for not clarifying that earlier.

00:41:41.482 --> 00:41:44.471
But the bigger point
here is that for any time

00:41:44.471 --> 00:41:46.832
you know that you want
to make some fixed number

00:41:46.832 --> 00:41:51.003
of outputs from your network,
if you know for example.

00:41:51.003 --> 00:41:54.344
Maybe you knew that you wanted to,

00:41:54.344 --> 00:41:56.395
you knew that you always
are going to have pictures

00:41:56.395 --> 00:41:58.763
of a cat and a dog and
you want to predict both

00:41:58.763 --> 00:42:01.392
the bounding box of the cat
and the bounding box of the dog

00:42:01.392 --> 00:42:03.062
in that case you'd know
that you have a fixed number

00:42:03.062 --> 00:42:05.304
of outputs for each input
so you might imagine

00:42:05.304 --> 00:42:07.093
hooking up this type of regression

00:42:07.093 --> 00:42:09.264
classification plus localization framework

00:42:09.264 --> 00:42:10.743
for that problem as well.

00:42:10.743 --> 00:42:13.094
So this idea of some fixed
number of regression outputs

00:42:13.094 --> 00:42:14.872
can be applied to a lot
of different problems

00:42:14.872 --> 00:42:17.039
including pose estimation.

00:42:19.062 --> 00:42:23.531
So the next task that I want to
talk about is object detection

00:42:23.531 --> 00:42:25.342
and this is a really meaty topic.

00:42:25.342 --> 00:42:27.422
This is kind of a core
problem in computer vision

00:42:27.422 --> 00:42:29.910
and you could probably
teach a whole seminar class

00:42:29.910 --> 00:42:31.868
on just the history of object detection

00:42:31.868 --> 00:42:33.902
and various techniques applied there.

00:42:33.902 --> 00:42:35.931
So I'll be relatively
brief and try to go over

00:42:35.931 --> 00:42:39.691
the main big ideas of object
detection plus deep learning

00:42:39.691 --> 00:42:42.582
that have been used in
the last couple of years.

00:42:42.582 --> 00:42:44.731
But the idea in object detection is that

00:42:44.731 --> 00:42:47.942
we again start with some
fixed set of categories

00:42:47.942 --> 00:42:52.182
that we care about, maybe cats
and dogs and fish or whatever

00:42:52.182 --> 00:42:55.321
but some fixed set of categories
that we're interested in.

00:42:55.321 --> 00:42:59.030
And now our task is that
given our input image,

00:42:59.030 --> 00:43:02.470
every time one of those
categories appears in the image,

00:43:02.470 --> 00:43:05.641
we want to draw a box around
it and we want to predict

00:43:05.641 --> 00:43:08.710
the category of that
box so this is different

00:43:08.710 --> 00:43:10.902
from classification plus localization

00:43:10.902 --> 00:43:13.620
because there might be a
varying number of outputs

00:43:13.620 --> 00:43:15.302
for every input image.

00:43:15.302 --> 00:43:17.910
You don't know ahead of time
how many objects you expect

00:43:17.910 --> 00:43:20.081
to find in each image so that's,

00:43:20.081 --> 00:43:22.870
this ends up being a
pretty challenging problem.

00:43:22.870 --> 00:43:25.630
So we've seen graphs, so
this is kind of interesting.

00:43:25.630 --> 00:43:28.988
We've seen this graph
many times of the ImageNet

00:43:28.988 --> 00:43:31.870
classification performance
as a function of years

00:43:31.870 --> 00:43:34.761
and we saw that it just got
better and better every year

00:43:34.761 --> 00:43:37.342
and there's been a similar
trend with object detection

00:43:37.342 --> 00:43:39.131
because object detection
has again been one

00:43:39.131 --> 00:43:41.291
of these core problems in computer vision

00:43:41.291 --> 00:43:44.110
that people have cared
about for a very long time.

00:43:44.110 --> 00:43:46.390
So this slide is due to Ross Girshick

00:43:46.390 --> 00:43:48.742
who's worked on this
problem a lot and it shows

00:43:48.742 --> 00:43:51.070
the progression of object
detection performance

00:43:51.070 --> 00:43:54.441
on this one particular
data set called PASCAL VOC

00:43:54.441 --> 00:43:57.230
which has been relatively
used for a long time

00:43:57.230 --> 00:43:59.462
in the object detection community.

00:43:59.462 --> 00:44:02.428
And you can see that up until about 2012

00:44:02.428 --> 00:44:04.761
performance on object
detection started to stagnate

00:44:04.761 --> 00:44:08.161
and slow down a little
bit and then in 2013

00:44:08.161 --> 00:44:10.039
was when some of the first
deep learning approaches

00:44:10.039 --> 00:44:12.141
to object detection came
around and you could see

00:44:12.141 --> 00:44:13.982
that performance just shot up very quickly

00:44:13.982 --> 00:44:16.171
getting better and better year over year.

00:44:16.171 --> 00:44:21.422
One thing you might notice is that this plot ends in
2015 and it's actually continued to go up since then

00:44:21.422 --> 00:44:29.928
so the current state of the art in this data set is well over 80 and in fact a lot of recent
papers don't even report results on this data set anymore because it's considered too easy.

00:44:29.929 --> 00:44:37.421
So it's a little bit hard to know, I'm not actually sure what is the state
of the art number on this data set but it's off the top of this plot.

00:44:37.422 --> 00:44:40.924
Sorry, did you have a question?
Nevermind.

00:44:42.051 --> 00:44:50.960
Okay, so as I already said this is different from localization
because there might be differing numbers of objects for each image.

00:44:50.961 --> 00:44:57.770
So for example in this cat on the upper left there's only one object so
we only need to predict four numbers but now for this image in the middle

00:44:57.771 --> 00:45:05.551
there's three animals there so we need our network to
predict 12 numbers, four coordinates for each bounding box.

00:45:05.552 --> 00:45:13.210
Or in this example of many many ducks then you want your network to
predict a whole bunch of numbers. Again, four numbers for each duck.

00:45:13.211 --> 00:45:20.683
So this is quite different from object detection. Sorry
object detection is quite different from localization

00:45:20.683 --> 00:45:28.870
because in object detection you might have varying numbers of objects in
the image and you don't know ahead of time how many you expect to find.

00:45:28.870 --> 00:45:34.568
So as a result, it's kind of tricky if you want to
think of object detection as a regression problem.

00:45:34.568 --> 00:45:40.768
So instead, people tend to work, use kind of a different
paradigm when thinking about object detection.

00:45:40.768 --> 00:45:49.958
So one approach that's very common and has been used for a long time in
computer vision is this idea of sliding window approaches to object detection.

00:45:49.958 --> 00:45:59.360
So this is kind of similar to this idea of taking small patches and applying that
for semantic segmentation and we can apply a similar idea for object detection.

00:45:59.360 --> 00:46:05.118
So the ideas is that we'll take different crops from
the input image, in this case we've got this crop

00:46:05.118 --> 00:46:10.359
in the lower left hand corner of our image and now we
take that crop, feed it through our convolutional network

00:46:10.359 --> 00:46:14.829
and our convolutional network does a
classification decision on that input crop.

00:46:14.829 --> 00:46:18.160
It'll say that there's no dog
here, there's no cat here,

00:46:18.160 --> 00:46:23.899
and then in addition to the categories that we care about
we'll add an additional category called background

00:46:23.899 --> 00:46:32.288
and now our network can predict background in case it doesn't see any
of the categories that we care about, so then when we take this crop

00:46:32.288 --> 00:46:39.008
from the lower left hand corner here then our network would hopefully
predict background and say that no, there's no object here.

00:46:39.008 --> 00:46:44.128
Now if we take a different crop then our network
would predict dog yes, cat no, background no.

00:46:44.128 --> 00:46:47.680
We take a different crop we get dog yes,
cat no, background no.

00:46:47.680 --> 00:46:54.372
Or a different crop, dog no, cat yes,
background no. Does anyone see a problem here?

00:47:00.324 --> 00:47:04.764
Yeah, the question is how do you choose the
crops? So this is a huge problem right.

00:47:04.764 --> 00:47:10.543
Because there could be any number of objects in this image,
these objects could appear at any location in the image,

00:47:10.543 --> 00:47:15.583
these objects could appear at any size in the image,
these objects could also appear at any aspect ratio

00:47:15.583 --> 00:47:29.523
in the image, so if you want to do kind of a brute force sliding window approach you'd end up having to test thousands, tens
of thousands, many many many many different crops in order to tackle this problem with a brute force sliding window approach.

00:47:29.523 --> 00:47:37.532
And in the case where every one of those crops is going to be fed through a giant
convolutional network, this would be completely computationally intractable.

00:47:37.532 --> 00:47:45.920
So in practice people don't ever do this sort of brute force sliding
window approach for object detection using convolutional networks.

00:47:47.044 --> 00:47:54.492
Instead there's this cool line of work called region proposals
that comes from, this is not using deep learning typically.

00:47:54.492 --> 00:47:56.332
These are slightly more
traditional computer vision

00:47:56.332 --> 00:48:05.401
techniques but the idea is that a region proposal network kind of uses more
traditional signal processing, image processing type things to make some list

00:48:05.401 --> 00:48:14.341
of proposals for where, so given an input image, a region proposal network will
then give you something like a thousand boxes where an object might be present.

00:48:14.341 --> 00:48:22.382
So you can imagine that maybe we do some local, we look for edges in the image
and try to draw boxes that contain closed edges or something like that.

00:48:22.382 --> 00:48:30.132
These various types of image processing approaches, but these region proposal
networks will basically look for blobby regions in our input image and then give us

00:48:30.132 --> 00:48:38.962
some set of candidate proposal regions where objects might be
potentially found. And these are relatively fast-ish to run

00:48:38.962 --> 00:48:44.703
so one common example of a region proposal method that
you might see is something called Selective Search

00:48:44.703 --> 00:48:49.284
which I think actually gives you 2000 region
proposals, not the 1000 that it says on the slide.

00:48:49.284 --> 00:48:59.404
So you kind of run this thing and then after about two seconds of turning on your CPU it'll
spit out 2000 region proposals in the input image where objects are likely to be found

00:48:59.404 --> 00:49:05.052
so there'll be a lot of noise in those. Most of them will
not be true objects but there's a pretty high recall.

00:49:05.052 --> 00:49:11.204
If there is an object in the image then it does tend to get
covered by these region proposals from Selective Search.

00:49:11.204 --> 00:49:17.103
So now rather than applying our classification network
to every possible location and scale in the image

00:49:17.103 --> 00:49:25.164
instead what we can do is first apply one of these region proposal networks
to get some set of proposal regions where objects are likely located

00:49:25.164 --> 00:49:33.135
and now apply a convolutional network for classification to each of these
proposal regions and this will end up being much more computationally tractable

00:49:33.135 --> 00:49:36.903
than trying to do all
possible locations and scales.

00:49:36.903 --> 00:49:45.583
And this idea all came together in this paper called
R-CNN from a few years ago that does exactly that.

00:49:45.583 --> 00:49:53.263
So given our input image in this case we'll run some region proposal
network to get our proposals, these are also sometimes called

00:49:53.263 --> 00:49:56.724
regions of interest or ROI's
so again Selective Search

00:49:56.724 --> 00:49:59.692
gives you something like
2000 regions of interest.

00:49:59.692 --> 00:50:07.043
Now one of the problems here is that these input, these
regions in the input image could have different sizes

00:50:07.043 --> 00:50:13.143
but if we're going to run them all through a convolutional network
our classification, our convolutional networks for classification

00:50:13.143 --> 00:50:18.149
all want images of the same input size typically
due to the fully connected net layers and whatnot

00:50:18.149 --> 00:50:26.855
so we need to take each of these region proposals and warp them to that
fixed square size that is expected as input to our downstream network.

00:50:26.855 --> 00:50:34.090
So we'll crop out those region proposal, those regions corresponding
to the region proposals, we'll warp them to that fixed size,

00:50:34.090 --> 00:50:37.418
and then we'll run each of them
through a convolutional network

00:50:37.418 --> 00:50:48.479
which will then use in this case an SVM to make a classification decision
for each of those, to predict categories for each of those crops.

00:50:48.479 --> 00:50:52.506
And then I lost a slide.

00:50:52.506 --> 00:51:05.650
But it'll also, not shown in the slide right now but in addition R-CNN also predicts a regression,
like a correction to the bounding box in addition for each of these input region proposals

00:51:05.650 --> 00:51:13.549
because the problem is that your input region proposals are kind of generally in the
right position for an object but they might not be perfect so in addition R-CNN will,

00:51:13.549 --> 00:51:24.658
in addition to category labels for each of these proposals, it'll also predict four numbers that
are kind of an offset or a correction to the box that was predicted at the region proposal stage.

00:51:24.658 --> 00:51:27.919
So then again, this is a multi-task loss
and you would train this whole thing.

00:51:27.919 --> 00:51:30.169
Sorry was there a question?

00:51:35.511 --> 00:51:39.359
The question is how much does the change
in aspect ratio impact accuracy?

00:51:40.698 --> 00:51:41.772
It's a little bit hard to say.

00:51:41.772 --> 00:51:46.551
I think there's some controlled experiments
in some of these papers but I'm not sure

00:51:46.551 --> 00:51:48.738
I can give a generic answer to that.

00:51:48.738 --> 00:51:49.571
Question?

00:51:53.602 --> 00:51:56.772
The question is is it necessary
for regions of interest to be rectangles?

00:51:56.772 --> 00:52:03.731
So they typically are because it's tough to
warp these non-region things but once you move

00:52:03.731 --> 00:52:08.911
to something like instant segmentation then you
sometimes get proposals that are not rectangles.

00:52:08.911 --> 00:52:12.071
If you actually do care about predicting
things that are not rectangles.

00:52:12.071 --> 00:52:14.238
Is there another question?

00:52:18.704 --> 00:52:24.375
Yeah, so the question is are the region proposals
learned so in R-CNN it's a traditional thing.

00:52:24.375 --> 00:52:29.203
These are not learned, this is kind of some fixed algorithm
that someone wrote down but we'll see in a couple minutes

00:52:29.203 --> 00:52:33.466
that we can actually, we've changed that a
little bit in the last couple of years.

00:52:33.466 --> 00:52:35.633
Is there another question?

00:52:37.767 --> 00:52:40.735
The question is is the offset always
inside the region of interest?

00:52:40.735 --> 00:52:42.665
The answer is no, it doesn't have to be.

00:52:42.665 --> 00:52:50.786
You might imagine that suppose the region of interest put a box around a
person but missed the head then you could imagine the network inferring

00:52:50.786 --> 00:52:55.906
that oh this is a person but people usually have heads so
the network showed the box should be a little bit higher.

00:52:55.906 --> 00:52:59.666
So sometimes the final predicted boxes
will be outside the region of interest.

00:52:59.666 --> 00:53:00.499
Question?

00:53:08.110 --> 00:53:12.801
Yeah. Yeah the question is you have a lot of
ROI's that don't correspond to true objects?

00:53:15.877 --> 00:53:22.550
And like we said, in addition to the classes that you actually care about
you add an additional background class so your class scores can also

00:53:22.550 --> 00:53:26.289
predict background to say
that there was no object here.

00:53:26.289 --> 00:53:27.122
Question?

00:53:37.716 --> 00:53:40.894
Yeah, so the question is
what kind of data do we need

00:53:40.894 --> 00:53:53.383
and yeah, this is fully supervised in the sense that our training data has each image, consists of images.
Each image has all the object categories marked with bounding boxes for each instance of that category.

00:53:53.383 --> 00:54:02.945
There are definitely papers that try to approach this like oh what if you don't have the data.
What if you only have that data for some images? Or what if that data is noisy but at least

00:54:02.945 --> 00:54:08.568
in the generic case you assume full supervision
of all objects in the images at training time.

00:54:09.835 --> 00:54:16.535
Okay, so I think we've kind of alluded to this but there's
kind of a lot of problems with this R-CNN framework.

00:54:16.535 --> 00:54:21.644
And actually if you look at the figure here on the right you
can see that additional bounding box head so I'll put it back.

00:54:21.644 --> 00:54:25.811
But this is kind of still
computationally pretty expensive

00:54:27.436 --> 00:54:34.415
because if we've got 2000 region proposals, we're running each
of those proposals independently, that can be pretty expensive.

00:54:34.415 --> 00:54:42.895
There's also this question of relying on this fixed region proposal network, this
fixed region proposals, we're not learning them so that's kind of a problem.

00:54:42.895 --> 00:54:46.015
And just in practice it
ends up being pretty slow

00:54:46.015 --> 00:54:54.721
so in the original implementation R-CNN would actually dump all the features to
disk so it'd take hundreds of gigabytes of disk space to store all these features.

00:54:54.721 --> 00:54:58.472
Then training would be super slow since you have to
make all these different forward and backward passes

00:54:58.472 --> 00:55:06.134
through the image and it took something like 84 hours is one number
they've recorded for training time so this is super super slow.

00:55:06.134 --> 00:55:11.076
And now at test time it's also super slow,
something like roughly 30 seconds minute per image

00:55:11.076 --> 00:55:18.316
because you need to run thousands of forward passes through the convolutional
network for each of these region proposals so this ends up being pretty slow.

00:55:18.316 --> 00:55:27.404
Thankfully we have fast R-CNN that fixed a lot of these problems
so when we do fast R-CNN then it's going to look kind of the same.

00:55:27.404 --> 00:55:34.116
We're going to start with our input image but now rather than processing each
region of interest separately instead we're going to run the entire image

00:55:34.116 --> 00:55:41.924
through some convolutional layers all at once to give this high
resolution convolutional feature map corresponding to the entire image.

00:55:41.924 --> 00:55:46.652
And now we still are using some region proposals
from some fixed thing like Selective Search

00:55:46.652 --> 00:55:52.334
but rather than cropping out the pixels of the
image corresponding to the region proposals,

00:55:52.334 --> 00:56:04.745
instead we imagine projecting those region proposals onto this convolutional feature map and then taking crops from
the convolutional feature map corresponding to each proposal rather than taking crops directly from the image.

00:56:04.745 --> 00:56:13.425
And this allows us to reuse a lot of this expensive convolutional
computation across the entire image when we have many many crops per image.

00:56:13.425 --> 00:56:20.052
But again, if we have some fully connected layers downstream
those fully connected layers are expecting some fixed-size input

00:56:20.052 --> 00:56:26.131
so now we need to do some reshaping of those
crops from the convolutional feature map

00:56:26.131 --> 00:56:31.673
and they do that in a differentiable way using
something they call an ROI pooling layer.

00:56:31.673 --> 00:56:38.622
Once you have these warped crops from the convolutional
feature map then you can run these things through some

00:56:38.622 --> 00:56:45.673
fully connected layers and predict your classification scores
and your linear regression offsets to the bounding boxes.

00:56:45.673 --> 00:56:51.654
And now when we train this thing then we again have a multi-task loss
that trades off between these two constraints and during back propagation

00:56:51.654 --> 00:56:56.124
we can back prop through this entire thing
and learn it all jointly.

00:56:56.124 --> 00:57:03.575
This ROI pooling, it looks kind of like max pooling. I don't
really want to get into the details of that right now.

00:57:03.575 --> 00:57:12.014
And in terms of speed if we look at R-CNN versus fast R-CNN versus
this other model called SPP net which is kind of in between the two,

00:57:12.014 --> 00:57:16.924
then you can see that at training time fast R-CNN
is something like 10 times faster to train

00:57:16.924 --> 00:57:20.134
because we're sharing all this computation
between different feature maps.

00:57:20.134 --> 00:57:23.272
And now at test time
fast R-CNN is super fast

00:57:23.272 --> 00:57:33.764
and in fact fast R-CNN is so fast at test time that its computation
time is actually dominated by computing region proposals.

00:57:33.764 --> 00:57:39.334
So we said that computing these 2000 region proposals
using Selective Search takes something like two seconds

00:57:39.334 --> 00:57:53.273
and now once we've got all these region proposals then because we're processing them all sort of in a shared way by sharing these
expensive convolutions across the entire image that we can process all of these region proposals in less than a second altogether.

00:57:53.273 --> 00:57:59.142
So fast R-CNN ends up being bottlenecked by
just the computing of these region proposals.

00:57:59.142 --> 00:58:03.804
Thankfully we've solved this
problem with faster R-CNN.

00:58:03.804 --> 00:58:13.734
So the idea in faster R-CNN is to just make, so the problem was the
computing the region proposals using this fixed function was a bottleneck.

00:58:13.734 --> 00:58:18.054
So instead we'll just make the network
itself predict its own region proposals.

00:58:18.054 --> 00:58:30.572
And so the way that this sort of works is that again, we take our input image, run the entire input image altogether
through some convolutional layers to get some convolutional feature map representing the entire high resolution image

00:58:30.572 --> 00:58:33.204
and now there's a separate
region proposal network

00:58:33.204 --> 00:58:39.204
which works on top of those convolutional features and
predicts its own region proposals inside the network.

00:58:39.204 --> 00:58:44.542
Now once we have those predicted region
proposals then it looks just like fast R-CNN

00:58:44.542 --> 00:58:50.662
where now we take crops from those region proposals from the
convolutional features, pass them up to the rest of the network.

00:58:50.662 --> 00:58:57.094
And now we talked about multi-task losses and multi-task
training networks to do multiple things at once.

00:58:57.094 --> 00:59:05.019
Well now we're telling the network to do four things all at once
so balancing out this four-way multi-task loss is kind of tricky.

00:59:05.019 --> 00:59:14.848
But because the region proposal network needs to do two things: it needs to say for
each potential proposal is it an object or not an object, it needs to actually regress

00:59:14.848 --> 00:59:18.186
the bounding box coordinates
for each of those proposals,

00:59:18.186 --> 00:59:21.787
and now the final network at the end
needs to do these two things again.

00:59:21.787 --> 00:59:26.288
Make final classification decisions for what are
the class scores for each of these proposals,

00:59:26.288 --> 00:59:34.086
and also have a second round of bounding box regression to again
correct any errors that may have come from the region proposal stage.

00:59:34.086 --> 00:59:34.919
Question?

00:59:45.231 --> 00:59:50.703
So the question is that sometimes multi-task learning might be
seen as regularization and are we getting that affect here?

00:59:50.703 --> 00:59:52.602
I'm not sure if there's been
super controlled studies

00:59:52.602 --> 01:00:01.162
on that but actually in the original version of the faster R-CNN
paper they did a little bit of experimentation like what if we share

01:00:01.162 --> 01:00:03.951
the region proposal network,
what if we don't share?

01:00:03.951 --> 01:00:08.522
What if we learn separate convolutional networks for the
region proposal network versus the classification network?

01:00:08.522 --> 01:00:12.970
And I think there were minor differences but
it wasn't a dramatic difference either way.

01:00:12.970 --> 01:00:18.380
So in practice it's kind of nicer to only learn
one because it's computationally cheaper.

01:00:18.380 --> 01:00:19.713
Sorry, question?

01:00:33.583 --> 01:00:41.903
Yeah the question is how do you train this region proposal network because you don't
know, you don't have ground truth region proposals for the region proposal network.

01:00:41.903 --> 01:00:45.172
So that's a little bit hairy. I don't
want to get too much into those details

01:00:45.172 --> 01:00:53.452
but the idea is that at any time you have a region proposal which has
more than some threshold of overlap with any of the ground truth objects

01:00:53.452 --> 01:00:57.771
then you say that that is the positive region proposal
and you should predict that as the region proposal

01:00:57.771 --> 01:01:04.471
and any potential proposal which has very low overlap with
any ground truth objects should be predicted as a negative.

01:01:04.471 --> 01:01:09.550
But there's a lot of dark magic hyperparameters
in that process and that's a little bit hairy.

01:01:09.550 --> 01:01:10.383
Question?

01:01:15.394 --> 01:01:19.793
Yeah, so the question is what is the classification
loss on the region proposal network and the answer is

01:01:19.793 --> 01:01:26.648
that it's making a binary, so I didn't want to get into too much of the details of
that architecture 'cause it's a little bit hairy but it's making binary decisions.

01:01:26.648 --> 01:01:32.269
So it has some set of potential regions that it's
considering and it's making a binary decision for each one.

01:01:32.269 --> 01:01:34.078
Is this an object or not an object?

01:01:34.078 --> 01:01:37.578
So it's like a binary classification loss.

01:01:38.520 --> 01:01:43.658
So once you train this thing then faster
R-CNN ends up being pretty darn fast.

01:01:43.658 --> 01:01:48.706
So now because we've eliminated this overhead from
computing region proposals outside the network,

01:01:48.706 --> 01:01:53.588
now faster R-CNN ends up being very very
fast compared to these other alternatives.

01:01:53.588 --> 01:01:59.388
Also, one interesting thing is that because we're
learning the region proposals here you might imagine

01:01:59.388 --> 01:02:05.086
maybe what if there was some mismatch between
this fixed region proposal algorithm and my data?

01:02:05.086 --> 01:02:16.320
So in this case once you're learning your own region proposals then you can overcome that
mismatch if your region proposals are somewhat weird or different than other data sets.

01:02:16.320 --> 01:02:22.914
So this whole family of R-CNN methods, R stands
for region, so these are all region-based methods

01:02:22.914 --> 01:02:30.716
because there's some kind of region proposal and then we're doing some
processing, some independent processing for each of those potential regions.

01:02:30.716 --> 01:02:36.708
So this whole family of methods are called these
region-based methods for object detection.

01:02:36.708 --> 01:02:40.676
But there's another family of methods that
you sometimes see for object detection

01:02:40.676 --> 01:02:43.818
which is sort of all feed
forward in a single pass.

01:02:43.818 --> 01:02:48.076
So one of these is YOLO
for You Only Look Once.

01:02:48.076 --> 01:02:50.796
And another is SSD for
Single Shot Detection

01:02:50.796 --> 01:02:54.067
and these two came out
somewhat around the same time.

01:02:54.067 --> 01:03:02.348
But the idea is that rather than doing independent processing for each of these potential
regions instead we want to try to treat this like a regression problem and just make

01:03:02.348 --> 01:03:06.156
all these predictions all at once
with some big convolutional network.

01:03:06.156 --> 01:03:13.468
So now given our input image you imagine dividing that input image
into some coarse grid, in this case it's a seven by seven grid

01:03:13.468 --> 01:03:18.556
and now within each of those grid cells you
imagine some set of base bounding boxes.

01:03:18.556 --> 01:03:25.748
Here I've drawn three base bounding boxes like a tall one, a wide
one, and a square one but in practice you would use more than three.

01:03:25.748 --> 01:03:32.858
So now for each of these grid cells and for each of these
base bounding boxes you want to predict several things.

01:03:32.858 --> 01:03:41.868
One, you want to predict an offset off the base bounding box to predict
what is the true location of the object off this base bounding box.

01:03:43.020 --> 01:03:51.460
And you also want to predict classification scores so maybe a
classification score for each of these base bounding boxes.

01:03:51.460 --> 01:03:55.503
How likely is it that an object of this
category appears in this bounding box.

01:03:55.503 --> 01:04:03.929
So then at the end we end up predicting from our input image, we end
up predicting this giant tensor of seven by seven grid by 5B + C.

01:04:04.951 --> 01:04:12.700
So that's just where we have B base bounding boxes, we have five numbers
for each giving our offset and our confidence for that base bounding box

01:04:12.700 --> 01:04:16.340
and C classification scores
for our C categories.

01:04:16.340 --> 01:04:23.522
So then we kind of see object detection as this input
of an image, output of this three dimensional tensor

01:04:23.522 --> 01:04:27.722
and you can imagine just training this whole
thing with a giant convolutional network.

01:04:27.722 --> 01:04:30.682
And that's kind of what
these single shot methods do

01:04:30.682 --> 01:04:41.180
where they just, and again matching the ground truth objects into these potential
base boxes becomes a little bit hairy but that's what these methods do.

01:04:41.180 --> 01:04:48.539
And by the way, the region proposal network that gets used in faster
R-CNN ends up looking quite similar to these where they have some set

01:04:48.539 --> 01:04:55.279
of base bounding boxes over some gridded image, another region
proposal network does some regression plus some classification.

01:04:55.279 --> 01:04:59.196
So there's kind of some
overlapping ideas here.

01:05:00.388 --> 01:05:13.892
So in faster R-CNN we're kind of treating the object, the region proposal step as kind of this fixed end-to-end
regression problem and then we do the separate per region processing but now with these single shot methods

01:05:13.892 --> 01:05:19.761
we only do that first step and just do all of our
object detection with a single forward pass.

01:05:19.761 --> 01:05:21.740
So object detection has a
ton of different variables.

01:05:21.740 --> 01:05:23.950
There could be different
base networks like VGG,

01:05:23.950 --> 01:05:29.601
ResNet, we've seen different metastrategies for
object detection including this faster R-CNN

01:05:29.601 --> 01:05:31.820
type region based family of methods,

01:05:31.820 --> 01:05:34.060
this single shot detection
family of methods.

01:05:34.060 --> 01:05:38.153
There's kind of a hybrid that I didn't talk about
called R-FCN which is somewhat in between.

01:05:38.153 --> 01:05:39.580
There's a lot of different hyperparameters

01:05:39.580 --> 01:05:43.590
like what is the image size,
how many region proposals do you use.

01:05:43.590 --> 01:05:48.022
And there's actually this really cool paper that
will appear at CVPR this summer that does a really

01:05:48.022 --> 01:05:56.353
controlled experimentation around a lot of these different variables and tries
to tell you how do these methods all perform under these different variables.

01:05:56.353 --> 01:05:58.676
So if you're interested I'd
encourage you to check it out

01:05:58.676 --> 01:06:06.702
but kind of one of the key takeaways is that the faster R-CNN style of
region based methods tends to give higher accuracies but ends up being

01:06:06.702 --> 01:06:08.972
much slower than the single shot methods

01:06:08.972 --> 01:06:12.486
because the single shot methods don't
require this per region processing.

01:06:12.486 --> 01:06:17.204
But I encourage you to check out
this paper if you want more details.

01:06:17.204 --> 01:06:24.621
Also as a bit of aside, I had this fun paper with Andre a couple years
ago that kind of combined object detection with image captioning

01:06:24.621 --> 01:06:27.273
and did this problem
called dense captioning

01:06:27.273 --> 01:06:32.472
so now the idea is that rather than predicting
a fixed category label for each region,

01:06:32.472 --> 01:06:35.084
instead we want to write
a caption for each region.

01:06:35.084 --> 01:06:41.033
And again, we had some data set that had this sort of data
where we had a data set of regions together with captions

01:06:41.033 --> 01:06:46.153
and then we sort of trained this giant end-to-end
model that just predicted these captions all jointly.

01:06:46.153 --> 01:06:50.962
And this ends up looking somewhat like faster
R-CNN where you have some region proposal stage

01:06:50.962 --> 01:06:53.764
then a bounding box, then
some per region processing.

01:06:53.764 --> 01:07:03.454
But rather than a SVM or a softmax loss instead those per region processing
has a whole RNN language model that predicts a caption for each region.

01:07:03.454 --> 01:07:06.814
So that ends up looking quite
a bit like faster R-CNN.

01:07:06.814 --> 01:07:11.524
There's a video here but I think
we're running out of time so I'll skip it.

01:07:11.524 --> 01:07:17.897
But the idea here is that once you have this, you
can kind of tie together a lot of these ideas

01:07:17.897 --> 01:07:21.508
and if you have some new problem that you're
interested in tackling like dense captioning,

01:07:21.508 --> 01:07:26.860
you can recycle a lot of the components that you've learned
from other problems like object detection and image captioning

01:07:26.860 --> 01:07:32.565
and kind of stitch together one end-to-end network that
produces the outputs that you care about for your problem.

01:07:32.565 --> 01:07:36.567
So the last task that I want to talk about
is this idea of instance segmentation.

01:07:36.567 --> 01:07:40.636
So here instance segmentation is
in some ways like the full problem

01:07:40.636 --> 01:07:50.594
We're given an input image and we want to predict one, the locations and identities
of objects in that image similar to object detection, but rather than just

01:07:50.594 --> 01:07:55.385
predicting a bounding box for each of those objects,
instead we want to predict a whole segmentation mask

01:07:55.385 --> 01:08:02.785
for each of those objects and predict which pixels in
the input image corresponds to each object instance.

01:08:02.785 --> 01:08:07.484
So this is kind of like a hybrid between
semantic segmentation and object detection

01:08:07.484 --> 01:08:15.196
because like object detection we can handle multiple objects and we
differentiate the identities of different instances so in this example

01:08:15.196 --> 01:08:21.924
since there are two dogs in the image and instance segmentation
method actually distinguishes between the two dog instances

01:08:21.924 --> 01:08:32.765
and the output and kind of like semantic segmentation we have this pixel wise accuracy
where for each of these objects we want to say which pixels belong to that object.

01:08:32.765 --> 01:08:38.247
So there's been a lot of different methods that people
have tackled, for instance segmentation as well,

01:08:38.247 --> 01:08:49.868
but the current state of the art is this new paper called Mask R-CNN that actually just came
out on archive about a month ago so this is not yet published, this is like super fresh stuff.

01:08:49.868 --> 01:08:52.675
And this ends up looking
a lot like faster R-CNN.

01:08:52.676 --> 01:08:55.296
So it has this multi-stage
processing approach

01:08:55.296 --> 01:09:05.622
where we take our whole input image, that whole input image goes into some convolutional
network and some learned region proposal network that's exactly the same as faster R-CNN

01:09:05.622 --> 01:09:14.795
and now once we have our learned region proposals then we project those proposals
onto our convolutional feature map just like we did in fast and faster R-CNN.

01:09:14.796 --> 01:09:21.228
But now rather than just making a classification and a bounding box
for regression decision for each of those boxes we in addition

01:09:21.229 --> 01:09:27.478
want to predict a segmentation mask for each of those
bounding box, for each of those region proposals.

01:09:27.478 --> 01:09:36.888
So now it kind of looks like a mini, like a semantic segmentation problem inside
each of the region proposals that we're getting from our region proposal network.

01:09:36.889 --> 01:09:45.947
So now after we do this ROI aligning to warp our features corresponding to the
region of proposal into the right shape, then we have two different branches.

01:09:45.948 --> 01:09:53.750
One branch will come up that looks exact, and this first branch at the top
looks just like faster R-CNN and it will predict classification scores

01:09:53.750 --> 01:09:59.318
telling us what is the category corresponding to that region
of proposal or alternatively whether or not it's background.

01:09:59.318 --> 01:10:04.596
And we'll also predict some bounding box coordinates
that regressed off the region proposal coordinates.

01:10:04.596 --> 01:10:13.550
And now in addition we'll have this branch at the bottom which looks basically
like a semantic segmentation mini network which will classify for each pixel

01:10:13.550 --> 01:10:17.780
in that input region proposal
whether or not it's an object

01:10:17.780 --> 01:10:29.230
so this mask R-CNN problem, this mask R-CNN architecture just kind of unifies all of these different
problems that we've been talking about today into one nice jointly end-to-end trainable model.

01:10:29.230 --> 01:10:36.710
And it's really cool and it actually works really really well so
when you look at the examples in the paper they're kind of amazing.

01:10:36.710 --> 01:10:39.078
They look kind of indistinguishable
from ground truth.

01:10:39.078 --> 01:10:49.497
So in this example on the left you can see that there are these two people standing in front of motorcycles, it's drawn
the boxes around these people, it's also gone in and labeled all the pixels of those people and it's really small

01:10:49.497 --> 01:10:54.961
but actually in the background on that image on the left there's
also a whole crowd of people standing very small in the background.

01:10:54.961 --> 01:10:58.628
It's also drawn boxes around each of those and
grabbed the pixels of each of those images.

01:10:58.628 --> 01:11:08.028
And you can see that this is just, it ends up working really really well and
it's a relatively simple addition on top of the existing faster R-CNN framework.

01:11:08.028 --> 01:11:15.108
So I told you that mask R-CNN unifies everything we talked
about today and it also does pose estimation by the way.

01:11:15.108 --> 01:11:22.257
So we talked about, you can do pose estimation by predicting
these joint coordinates for each of the joints of the person

01:11:22.257 --> 01:11:29.388
so you can do mask R-CNN to do joint object detection,
pose estimation, and instance segmentation.

01:11:29.388 --> 01:11:35.246
And the only addition we need to make is that for each of
these region proposals we add an additional little branch

01:11:35.246 --> 01:11:42.628
that predicts these coordinates of the joints for
the instance of the current region proposal.

01:11:42.628 --> 01:11:51.715
So now this is just another loss, like another layer that we add, another
head coming out of the network and an additional term in our multi-task loss.

01:11:51.715 --> 01:11:59.406
But once we add this one little branch then you can do all of these
different problems jointly and you get results looking something like this.

01:11:59.406 --> 01:12:02.705
Where now this network, like
a single feed forward network

01:12:02.705 --> 01:12:09.792
is deciding how many people are in the image, detecting where
those people are, figuring out the pixels corresponding to each

01:12:09.792 --> 01:12:22.742
of those people and also drawing a skeleton estimating the pose of those people and this works really well even in crowded scenes
like this classroom where there's a ton of people sitting and they all overlap each other and it just seems to work incredibly well.

01:12:22.742 --> 01:12:28.291
And because it's built on the faster R-CNN framework
it also runs relatively close to real time

01:12:28.291 --> 01:12:36.061
so this is running something like five frames per second on a GPU because
this is all sort of done in the single forward pass of the network.

01:12:36.061 --> 01:12:42.833
So this is again, a super new paper but I think that this
will probably get a lot of attention in the coming months.

01:12:42.833 --> 01:12:45.430
So just to recap, we've talked.

01:12:45.430 --> 01:12:46.680
Sorry question?

01:12:53.800 --> 01:12:55.781
The question is how much
training data do you need?

01:12:55.781 --> 01:13:00.948
So all of these instant segmentation results
were trained on the Microsoft Coco data set

01:13:00.948 --> 01:13:08.320
so Microsoft Coco is roughly 200,000 training
images. It has 80 categories that it cares about

01:13:08.320 --> 01:13:14.010
so in each of those 200,000 training images it has
all the instances of those 80 categories labeled.

01:13:14.010 --> 01:13:23.285
So there's something like 200,000 images for training and there's something like I think
an average of fivee or six instances per image. So it actually is quite a lot of data.

01:13:23.285 --> 01:13:32.000
And for Microsoft Coco for all the people in Microsoft Coco they also have
all the joints annotated as well so this actually does have quite a lot

01:13:32.000 --> 01:13:36.669
of supervision at training time you're right, and
actually is trained with quite a lot of data.

01:13:36.669 --> 01:13:42.050
So I think one really interesting topic to
study moving forward is that we kind of know

01:13:42.050 --> 01:13:50.701
that if you have a lot of data to solve some problem, at this point we're relatively confident that
you can stitch up some convolutional network that can probably do a reasonable job at that problem

01:13:50.701 --> 01:13:59.069
but figuring out ways to get performance like this with less training data is a super
interesting and active area of research and I think that's something people will be spending

01:13:59.069 --> 01:14:03.301
a lot of their efforts working
on in the next few years.

01:14:03.301 --> 01:14:08.068
So just to recap, today we had kind of a whirlwind tour
of a whole bunch of different computer vision topics

01:14:08.068 --> 01:14:15.925
and we saw how a lot of the machinery that we built up from image classification
can be applied relatively easily to tackle these different computer vision topics.

01:14:15.925 --> 01:14:22.835
And next time we'll talk about, we'll have a
really fun lecture on visualizing CNN features.